[Core][aDag] Support multi node multi reader #47480

rkooo567 · 2024-09-04T15:41:53Z

Why are these changes needed?

This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing.

multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes.

Related issue number

Closes #46269

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2024-09-06T08:36:44Z

python/ray/dag/tests/experimental/test_accelerated_dag.py

@@ -1448,59 +1447,6 @@ def test_driver_and_actor_as_readers(ray_start_cluster):
        dag.experimental_compile()


-def test_payload_large(ray_start_cluster):


moved to multi node test suite

rkooo567 · 2024-09-06T08:36:53Z

python/ray/experimental/channel/shared_memory_channel.py

@@ -16,7 +17,7 @@
 # entry/init points.
 logger = logging.getLogger(__name__)

-DEFAULT_MAX_BUFFER_SIZE = int(100 * 1e6)  # 100 mB


this was a bug

I think there's a fundamental fix. I will fix it in a separate PR

Did you mean the buffer size should be 1MB? If that's the case, can you update comment as well?

IIUC, we have this 1MB buffer size

ray/python/ray/dag/context.py

Line 60 in 3e8dd0d

buffer_size_bytes: int = DEFAULT_BUFFER_SIZE_BYTES

.

But it was not passed correctly (meainng our default buffer size has been 100mb)

rkooo567 · 2024-09-06T08:38:47Z

python/ray/experimental/channel/shared_memory_channel.py

                timeout_ms,
            )
+            # TODO(sang): Clean the previous ref that won't be used.


This currently leaks a thread whenever resizing happens. we should fix it

python/ray/dag/tests/experimental/test_multi_node_dag.py

ruisearch42 · 2024-09-06T16:10:11Z

python/ray/experimental/channel/shared_memory_channel.py

@@ -16,7 +17,7 @@
 # entry/init points.
 logger = logging.getLogger(__name__)

-DEFAULT_MAX_BUFFER_SIZE = int(100 * 1e6)  # 100 mB


Did you mean the buffer size should be 1MB? If that's the case, can you update comment as well?

python/ray/experimental/channel/shared_memory_channel.py

ruisearch42 · 2024-09-06T16:15:06Z

python/ray/experimental/channel/shared_memory_channel.py

+        _reader_node_ids: Optional[Set["ray.NodeID"]] = None,
        _writer_ref: Optional["ray.ObjectRef"] = None,
-        _reader_ref: Optional["ray.ObjectRef"] = None,
+        _reader_refs: Optional[Dict[str, "ray.ObjectRef"]] = None,


Can you please update docstring?

there's a comment underneath, and given other private args don't have docstring, I will keep it as is. Lmk if you think we should update docstring for all private attr

Yeah I think we should move it to the docstring so that the caller knows what to pass in. There are also args that seem to contain overlapping information, which need clean up or clarification.

python/ray/experimental/channel/shared_memory_channel.py

ruisearch42 · 2024-09-09T16:09:28Z

python/ray/experimental/channel/shared_memory_channel.py

-        self._reader_ref = reader_ref
+    def __init__(
+        self,
+        _node_id_to_reader_info: Dict[str, ReaderInfo] = None,


my bad. it should not be None

ruisearch42 · 2024-09-09T16:50:18Z

python/ray/experimental/channel/shared_memory_channel.py

+            self._reader_node_ids = _reader_node_ids
+            self._node_id_to_reader_info = _node_id_to_reader_info
+
+        assert self._num_local_readers == 0


This is a bit weird, it is set to 0 at L233, and asserted here. Why not just set it here?

rkooo567 · 2024-09-09T17:00:33Z

all comments are addressed. premerge passing

ruisearch42 · 2024-09-09T17:14:40Z

python/ray/experimental/channel/shared_memory_channel.py

+            remote_reader_refs,
+            remote_reader_node_ids,
+            remote_reader_ids,
+            remote_num_readers_per_node,


Can we define a struct for each "node reader"? It is less error prone and we don't need the assert on L357

rkooo567 · 2024-09-09T21:24:33Z

cc @ruisearch42 do you think we can merge this today ?

ruisearch42

Looks good

ruisearch42 · 2024-09-10T00:52:40Z

src/ray/core_worker/experimental_mutable_object_provider.cc

+  std::shared_ptr<std::vector<std::shared_ptr<MutableObjectReaderInterface>>>
+      remote_readers =
+          std::make_shared<std::vector<std::shared_ptr<MutableObjectReaderInterface>>>();
+  // TODO(sang): Currently, these attributes are not cleaned up.


Which attributes?

ruisearch42 · 2024-09-10T01:12:25Z

src/ray/core_worker/experimental_mutable_object_provider.cc

+            RAY_LOG(ERROR)
+                << "Failed to transfer object to a remote node for an object id "
+                << writer_object_id << ". It can cause hang.";
+          }


We should have a hard failure here?

ruisearch42 · 2024-09-10T01:19:32Z

python/ray/experimental/channel/shared_memory_channel.py

+                self._worker.core_worker.experimental_channel_register_reader(
+                    reader_ref_info.reader_ref,
+                )


We need to assert this is called exactly once?

python/ray/experimental/channel/shared_memory_channel.py

…)" This reverts commit 57136b5.

This PR supports multi readers in multi nodes. It also adds tests that the feature works with large gRPC payloads and buffer resizing. multi readers in multi node didn't work because the code allows to only register 1 remote reader reference on 1 specific node. This fixes the issues by allowing to register remote reader references in multi nodes. Signed-off-by: ujjawal-khare <[email protected]>

rkooo567 added 6 commits August 30, 2024 23:22

.

2ad93e7

Merge branch 'master' into multi-node-multi-reader

55ff85d

ip

e1b9fef

.

5ea646b

working

9bd2e7e

done

d947ccb

rkooo567 assigned kevin85421 and ruisearch42 Sep 6, 2024

rkooo567 commented Sep 6, 2024

View reviewed changes

rkooo567 changed the title ~~[wip][aDag] Support multi node multi reader~~ [Core][aDag] Support multi node multi reader Sep 6, 2024

rkooo567 added the go add ONLY when ready to merge, run all tests label Sep 6, 2024

ruisearch42 reviewed Sep 6, 2024

View reviewed changes

rkooo567 added 3 commits September 9, 2024 00:09

Addressed code review.

d7f9433

Merge branch 'master' into multi-node-multi-reader

5bf9939

working now.

e3dac1d

ruisearch42 reviewed Sep 9, 2024

View reviewed changes

done

b07715c

ruisearch42 reviewed Sep 9, 2024

View reviewed changes

rkooo567 added 2 commits September 9, 2024 12:25

done

5b35173

.

26ffedf

ruisearch42 approved these changes Sep 10, 2024

View reviewed changes

rkooo567 added 3 commits September 9, 2024 20:17

Merge branch 'master' into multi-node-multi-reader

5eb30fa

fixed

14a47f9

.

b6ed67e

rkooo567 enabled auto-merge (squash) September 10, 2024 16:56

rkooo567 merged commit 57136b5 into ray-project:master Sep 10, 2024
6 checks passed

rkooo567 pushed a commit to rkooo567/ray that referenced this pull request Sep 11, 2024

Revert "[Core][aDag] Support multi node multi reader (ray-project#47480…

1333466

…)" This reverts commit 57136b5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][aDag] Support multi node multi reader #47480

[Core][aDag] Support multi node multi reader #47480

rkooo567 commented Sep 4, 2024 •

edited

Loading

rkooo567 Sep 6, 2024

rkooo567 Sep 6, 2024

rkooo567 Sep 6, 2024

ruisearch42 Sep 6, 2024

rkooo567 Sep 9, 2024

rkooo567 Sep 6, 2024

ruisearch42 Sep 6, 2024

ruisearch42 Sep 6, 2024

rkooo567 Sep 9, 2024

ruisearch42 Sep 9, 2024

ruisearch42 Sep 9, 2024

rkooo567 Sep 9, 2024

ruisearch42 Sep 9, 2024

rkooo567 commented Sep 9, 2024

ruisearch42 Sep 9, 2024

rkooo567 Sep 9, 2024

rkooo567 commented Sep 9, 2024

ruisearch42 left a comment

ruisearch42 Sep 10, 2024

ruisearch42 Sep 10, 2024

ruisearch42 Sep 10, 2024

		@@ -1448,59 +1447,6 @@ def test_driver_and_actor_as_readers(ray_start_cluster):
		dag.experimental_compile()


		def test_payload_large(ray_start_cluster):

[Core][aDag] Support multi node multi reader #47480

[Core][aDag] Support multi node multi reader #47480

Conversation

rkooo567 commented Sep 4, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Sep 9, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Sep 4, 2024 •

edited

Loading